Beautiful Soup Documentation

Beautiful Soup Documentation¶

Beautiful Soup is aPython library for pulling data out of HTML and XML files. It workswith your favorite parser to provide idiomatic ways of navigating,searching, and modifying the parse tree. It commonly saves programmershours or days of work.

These instructions illustrate all major features of Beautiful Soup 4,with examples. I show you what the library is good for, how it works,how to use it, how to make it do what you want, and what to do when itviolates your expectations.

This document covers Beautiful Soup version 4.8.1. The examples inthis documentation should work the same way in Python 2.7 and Python3.2.

You might be looking for the documentation for Beautiful Soup 3.If so, you should know that Beautiful Soup 3 is no longer beingdeveloped and that support for it will be dropped on or after December31, 2020. If you want to learn about the differences between BeautifulSoup 3 and Beautiful Soup 4, see Porting code to BS4.

This documentation has been translated into other languages byBeautiful Soup users:

这篇文档当然还有中文版.このページは日本語で利用できます(外部リンク)이 문서는 한국어 번역도 가능합니다.Este documento também está disponível em Português do Brasil.Getting help¶

If you have questions about Beautiful Soup, or run into problems,send mail to the discussion group. Ifyour problem involves parsing an HTML document, be sure to mentionwhat the diagnose() function says aboutthat document.

Quick Start¶

Here’s an HTML document I’ll be using as an example throughout thisdocument. It’s part of a story from Alice in Wonderland:

html_doc = """The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.

...

"""

Running the “three sisters” document through Beautiful Soup gives us aBeautifulSoup object, which represents the document as a nesteddata structure:

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, 'html.parser')print(soup.prettify())# # ##The Dormouse's story## # #

## The Dormouse's story##

#Once upon a time there were three little sisters; and their names were## Elsie##,## Lacie##and## Tillie##; and they lived at the bottom of a well.#

#...#

# #

Here are some simple ways to navigate that data structure:

soup.title# The Dormouse's storysoup.title.name# u'title'soup.title.string# u'The Dormouse's story'soup.title.parent.name# u'head'soup.p#

The Dormouse's story

soup.p['class']# u'title'soup.a# Elsiesoup.find_all('a')# [Elsie,# Lacie,# Tillie]soup.find(id="link3")# Tillie

One common task is extracting all the URLs found within a page’s tags:

for link in soup.find_all('a'):print(link.get('href'))# http://example.com/elsie# http://example.com/lacie# http://example.com/tillie

Another common task is extracting all the text from a page:

print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...

Does this look like what you need? If so, read on.

Installing Beautiful Soup¶

If you’re using a recent version of Debian or Ubuntu Linux, you caninstall Beautiful Soup with the system package manager:

$ apt-get install python-bs4 (for Python 2)

$ apt-get install python3-bs4 (for Python 3)

Beautiful Soup 4 is published through PyPi, so if you can’t install itwith the system packager, you can install it with easy_install orpip. The package name is beautifulsoup4, and the same packageworks on Python 2 and Python 3. Make sure you use the right version ofpip or easy_install for your Python version (these may be namedpip3 and easy_install3 respectively if you’re using Python 3).

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

(The BeautifulSoup package is probably not what you want. That’sthe previous major release, Beautiful Soup 3. Lots of software usesBS3, so it’s still available, but if you’re writing new code youshould install beautifulsoup4.)

If you don’t have easy_install or pip installed, you candownload the Beautiful Soup 4 source tarball andinstall it with setup.py.

$ python setup.py install

If all else fails, the license for Beautiful Soup allows you topackage the entire library with your application. You can download thetarball, copy its bs4 directory into your application’s codebase,and use Beautiful Soup without installing it at all.

I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but itshould work with other recent versions.

Problems after installation¶

Beautiful Soup is packaged as Python 2 code. When you install it foruse with Python 3, it’s automatically converted to Python 3 code. Ifyou don’t install the package, the code won’t be converted. There havealso been reports on Windows machines of the wrong version beinginstalled.

If you get the ImportError “No module named HTMLParser”, yourproblem is that you’re running the Python 2 version of the code underPython 3.

If you get the ImportError “No module named html.parser”, yourproblem is that you’re running the Python 3 version of the code underPython 2.

In both cases, your best bet is to completely remove the BeautifulSoup installation from your system (including any directory createdwhen you unzipped the tarball) and try the installation again.

If you get the SyntaxError “Invalid syntax” on the lineROOT_TAG_NAME = u'[document]', you need to convert the Python 2code to Python 3. You can do this either by installing the package:

$ python3 setup.py install

or by manually running Python’s 2to3 conversion script on thebs4 directory:

$ 2to3-3.2 -w bs4

Installing a parser¶

Beautiful Soup supports the HTML parser included in Python’s standardlibrary, but it also supports a number of third-party Python parsers.One is the lxml parser. Depending on your setup,you might install lxml with one of these commands:

$ apt-get install python-lxml

$ easy_install lxml

$ pip install lxml

Another alternative is the pure-Python html5lib parser, which parses HTML the way aweb browser does. Depending on your setup, you might install html5libwith one of these commands:

$ apt-get install python-html5lib

$ easy_install html5lib

$ pip install html5lib

This table summarizes the advantages and disadvantages of each parser library:

ParserTypical usageAdvantagesDisadvantagesPython’s html.parserBeautifulSoup(markup, "html.parser")Batteries includedDecent speedLenient (As of Python 2.7.3and 3.2.)Not as fast as lxml,less lenient thanhtml5lib.lxml’s HTML parserBeautifulSoup(markup, "lxml")Very fastLenientExternal C dependencylxml’s XML parserBeautifulSoup(markup, "lxml-xml")BeautifulSoup(markup, "xml")Very fastThe only currently supportedXML parserExternal C dependencyhtml5libBeautifulSoup(markup, "html5lib")Extremely lenientParses pages the same way aweb browser doesCreates valid HTML5Very slowExternal Pythondependency

If you can, I recommend you install and use lxml for speed. If you’reusing a version of Python 2 earlier than 2.7.3, or a version of Python3 earlier than 3.2.2, it’s essential that you install lxml orhtml5lib–Python’s built-in HTML parser is just not very good in olderversions.

Note that if a document is invalid, different parsers will generatedifferent Beautiful Soup trees for it. See Differencesbetween parsers for details.

Making the soup¶

To parse a document, pass it into the BeautifulSoupconstructor. You can pass in a string or an open filehandle:

from bs4 import BeautifulSoupwith open("index.html") as fp:soup = BeautifulSoup(fp)soup = BeautifulSoup("data")

First, the document is converted to Unicode, and HTML entities areconverted to Unicode characters:

BeautifulSoup("Sacré bleu!")Sacré bleu!

Beautiful Soup then parses the document using the best availableparser. It will use an HTML parser unless you specifically tell it touse an XML parser. (See Parsing XML.)

Kinds of objects¶

Beautiful Soup transforms a complex HTML document into a complex treeof Python objects. But you’ll only ever have to deal with about fourkinds of objects: Tag, NavigableString, BeautifulSoup,and Comment.

Tag¶

A Tag object corresponds to an XML or HTML tag in the original document:

soup = BeautifulSoup('Extremely bold')tag = soup.btype(tag)#

Tags have a lot of attributes and methods, and I’ll cover most of themin Navigating the tree and Searching the tree. For now, the mostimportant features of a tag are its name and attributes.

Name¶

Every tag has a name, accessible as .name:

tag.name# u'b'

If you change a tag’s name, the change will be reflected in any HTMLmarkup generated by Beautiful Soup:

tag.name = "blockquote"tag# Extremely boldAttributes¶

A tag may have any number of attributes. The tag has an attribute “id” whose value is“boldest”. You can access a tag’s attributes by treating the tag likea dictionary:

tag['id']# u'boldest'

You can access that dictionary directly as .attrs:

tag.attrs# {u'id': 'boldest'}

You can add, remove, and modify a tag’s attributes. Again, this isdone by treating the tag as a dictionary:

tag['id'] = 'verybold'tag['another-attribute'] = 1tag# del tag['id']del tag['another-attribute']tag# tag['id']# KeyError: 'id'print(tag.get('id'))# NoneMulti-valued attributes¶

HTML 4 defines a few attributes that can have multiple values. HTML 5removes a couple of them, but defines a few more. The most commonmulti-valued attribute is class (that is, a tag can have more thanone CSS class). Others include rel, rev, accept-charset,headers, and accesskey. Beautiful Soup presents the value(s)of a multi-valued attribute as a list:

css_soup = BeautifulSoup('

')css_soup.p['class']# ["body"]css_soup = BeautifulSoup('

')css_soup.p['class']# ["body", "strikeout"]

If an attribute looks like it has more than one value, but it’s nota multi-valued attribute as defined by any version of the HTMLstandard, Beautiful Soup will leave the attribute alone:

id_soup = BeautifulSoup('

')id_soup.p['id']# 'my id'

When you turn a tag back into a string, multiple attribute values areconsolidated:

rel_soup = BeautifulSoup('

Back to the homepage

')rel_soup.a['rel']# ['index']rel_soup.a['rel'] = ['index', 'contents']print(rel_soup.p)#

Back to the homepage

You can disable this by passing multi_valued_attributes=None as akeyword argument into the BeautifulSoup constructor:

no_list_soup = BeautifulSoup('

', 'html', multi_valued_attributes=None)no_list_soup.p['class']# u'body strikeout'

You can use `get_attribute_list to get a value that’s always alist, whether or not it’s a multi-valued atribute:

id_soup.p.get_attribute_list('id')# ["my id"]

If you parse a document as XML, there are no multi-valued attributes:

xml_soup = BeautifulSoup('

', 'xml')xml_soup.p['class']# u'body strikeout'

Again, you can configure this using the multi_valued_attributes argument:

class_is_multi= { '*' : 'class'}xml_soup = BeautifulSoup('

', 'xml', multi_valued_attributes=class_is_multi)xml_soup.p['class']# [u'body', u'strikeout']

You probably won’t need to do this, but if you do, use the defaults asa guide. They implement the rules described in the HTML specification:

from bs4.builder import builder_registrybuilder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTESNavigableString¶

A string corresponds to a bit of text within a tag. Beautiful Soupuses the NavigableString class to contain these bits of text:

tag.string# u'Extremely bold'type(tag.string)#

A NavigableString is just like a Python Unicode string, exceptthat it also supports some of the features described in Navigatingthe tree and Searching the tree. You can convert aNavigableString to a Unicode string with unicode():

unicode_string = unicode(tag.string)unicode_string# u'Extremely bold'type(unicode_string)#

You can’t edit a string in place, but you can replace one string withanother, using replace_with():

tag.string.replace_with("No longer bold")tag# No longer bold

NavigableString supports most of the features described inNavigating the tree and Searching the tree, but not all ofthem. In particular, since a string can’t contain anything (the way atag may contain a string or another tag), strings don’t support the.contents or .string attributes, or the find() method.

If you want to use a NavigableString outside of Beautiful Soup,you should call unicode() on it to turn it into a normal PythonUnicode string. If you don’t, your string will carry around areference to the entire Beautiful Soup parse tree, even when you’redone using Beautiful Soup. This is a big waste of memory.

BeautifulSoup¶

The BeautifulSoup object represents the parsed document as awhole. For most purposes, you can treat it as a Tagobject. This means it supports most of the methods described inNavigating the tree and Searching the tree.

You can also pass a BeautifulSoup object into one of the methodsdefined in Modifying the tree, just as you would a Tag. Thislets you do things like combine two parsed documents:

doc = BeautifulSoup("INSERT FOOTER HERE

云奕文章网

Beautiful Soup Documentation

相关推荐：